Lithuanian Continuous Speech Corpus Lrn 1: an Improvement
نویسندگان
چکیده
This paper presents the development of Lithuanian continuous speech corpus LRN 1 (Lithuanian Radio News, version 1). The corpus was developed from speech corpus LRN 0.1 by increasing the duration of speech corpus (it lasts 20 hours 50 minutes). The major improvement of speech corpus LRN 1 was a development of time-aligned word level annotations of speech signals. Time-aligned word level annotations of speech signals were obtained after a two-stage process: automatic realignment of acoustic models of phonemes and subsequent manual correction of annotations. The improvement of the corpus is useful for constructing and evaluating speaker-independent continuous speech recognition systems and for linguistic research.
منابع مشابه
Lithuanian Continuous Speech Corpus Lrn 0.1: Design and Potential Applications
This paper presents design, development and contents of Lithuanian continuous speech corpus LRN 0.1 (Lithuanian Radio News, prototype-version 0.1). The corpus contains 17 hours 23 minutes of records from radio broadcast news read by 31 speakers. The recorded material is segmented into sentence-length records that are divided into training, development, and evaluation sets. Speech recordings are...
متن کاملTowards Acoustic Modeling of Lithuanian Speech
In this paper we present experimental investigation of using various phone sets for acoustic modeling of Lithuanian speech applied to large vocabulary continuous speech recognition. Paper presents specifics of Lithuanian speech acoustics including accentuation, diphthongs, softening and assimilation of consonants. The speech recognition experiments use only acoustic model since effective langua...
متن کاملCorpus-Based Hidden Markov Modelling of the Fundamental Frequency of Lithuanian
This paper presents the corpus-driven approach in building the computational model of fundamental frequency, or F0, for Lithuanian language. The model was obtained by training the HMM-based speech synthesis system HTS on six hours of speech coming from multiple speakers. Several gender specific models, using different parameters and different contextual factors, were investigated. The models we...
متن کاملFrom speech corpus to intonation corpus: clustering phrase pitch contours of Lithuanian
This paper presents our research in preparation to compile a Lithuanian intonation corpus. The main objective of this research was to discover characteristic patterns of Lithuanian intonation through clustering of pitch contours of intermediate intonation phrases. The paper covers the set of procedures that were used to extend an ordinary speech corpus to make it suitable for intonation analysi...
متن کاملFramework for Choosing a Set of Syllables and Phonemes for Lithuanian Speech Recognition
This paper describes a framework for making up a set of syllables and phonemes that subsequently is used in the creation of acoustic models for continuous speech recognition of Lithuanian. The target is to discover a set of syllables and phonemes that is of utmost importance in speech recognition. This framework includes operations with lexicon, and transcriptions of records. To facilitate this...
متن کامل